Assignment 2¶
Credits: Federico Ruggeri, Eleonora Mancini, Paolo Torroni
Keywords: Human Value Detection, Multi-label classification, Transformers, BERT
Contact¶
For any doubt, question, issue or help, you can always contact us at the following email addresses:
Teaching Assistants:
- Federico Ruggeri -> federico.ruggeri6@unibo.it
- Eleonora Mancini -> e.mancini@unibo.it
Professor:
- Paolo Torroni -> p.torroni@unibo.it
Introduction¶
You are tasked to address the Human Value Detection challenge.
Problem definition¶
Arguments are paired with their conveyed human values.
Arguments are in the form of premise $\rightarrow$ conclusion.
Example:¶
Premise: ``fast food should be banned because it is really bad for your health and is costly''
Conclusion: ``We should ban fast food''
Stance: in favour of
0.1 Imports¶
By calling enable_custom_widget_manager(), we enable the notebook to support custom widgets on Colab. However, to utilize Plotly's FigureWidget, we must downgrade it for compatibility reasons.
import pandas as pd
import os, random
import urllib.request
from tqdm import tqdm
from IPython.display import display
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import plot
import plotly.offline as pyo
import numpy as np
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel, logging
def set_reproducibility(seed = 42) -> None:
random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
set_reproducibility()
device = (
"cuda"
if torch.cuda.is_available()
else "cpu"
)
print(f"Using {device} device")
Using cuda device
[Task 1 - 0.5 points] Corpus¶
Check the official page of the challenge here.
The challenge offers several corpora for evaluation and testing.
You are going to work with the standard training, validation, and test splits.
Arguments¶
- arguments-training.tsv
- arguments-validation.tsv
- arguments-test.tsv
Human values¶
- labels-training.tsv
- labels-validation.tsv
- labels-test.tsv
Example¶
arguments-*.tsv¶
Argument ID A01005
Conclusion We should ban fast food
Stance in favor of
Premise fast food should be banned because it is really bad for your health and is costly.
labels-*.tsv¶
Argument ID A01005
Self-direction: thought 0
Self-direction: action 0
...
Universalism: objectivity: 0
Splits¶
The standard splits contain
- Train: 5393 arguments
- Validation: 1896 arguments
- Test: 1576 arguments
Annotations¶
In this assignment, you are tasked to address a multi-label classification problem.
You are going to consider level 3 categories:
- Openness to change
- Self-enhancement
- Conversation
- Self-transcendence
How to do that?
You have to merge (logical OR) annotations of level 2 categories belonging to the same level 3 category.
Pay attention to shared level 2 categories (e.g., Hedonism). $\rightarrow$ see Table 1 in the original paper.
Example¶
Self-direction: thought: 0
Self-direction: action: 1
Stimulation: 0
Hedonism: 1
Openess to change 1
Instructions¶
- Download the specificed training, validation, and test files.
- Encode split files into a pandas.DataFrame object.
- For each split, merge the arguments and labels dataframes into a single dataframe.
- Merge level 2 annotations to level 3 categories.
1.0 Variables and functions¶
files = [
"arguments-training.tsv",
"arguments-validation.tsv",
"arguments-test.tsv",
"labels-training.tsv",
"labels-validation.tsv",
"labels-test.tsv"
]
def showcase_dict_dataframes(dfs_dict: dict, n: int = 2) -> None:
"""
Prints information about DataFrames in a dictionary.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
n (int): Numbers of rows to show. Default 2.
Returns:
None
"""
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
for name, dataframe in dfs_dict.items():
print(f"DataFrame Name: {name} | Shape: {dataframe.shape}")
display(dataframe.head(n))
print()
def configure_plotly_browser_state():
"""
Configures Plotly to display graphs in Colab.
"""
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
},
});
</script>
'''))
1.1 Download corpus¶
def download_corpus(files: list, path: str = "corpus") -> None:
"""
Downloads corpus files from Zenodo.
Args:
files (list): List of file names to download.
path (str, optional): Path to establish where to download the files. Defaults to "corpus".
Returns:
None
"""
print("- Starting download...\n")
if not os.path.exists(path):
os.makedirs(path)
for file_name in files:
file_path = os.path.join(path, file_name)
if os.path.exists(file_path):
print(f"\t@ {file_name} already exists. Skipping download.\n")
else:
download_link = f"https://zenodo.org/records/10564870/files/{file_name}?download=1"
print(f"\t@ Downloading {file_name}...")
urllib.request.urlretrieve(download_link, file_path)
print(f"\t@ {file_name} downloaded successfully!\n")
print("- All files downloaded successfully.")
download_corpus(files)
- Starting download... @ arguments-training.tsv already exists. Skipping download. @ arguments-validation.tsv already exists. Skipping download. @ arguments-test.tsv already exists. Skipping download. @ labels-training.tsv already exists. Skipping download. @ labels-validation.tsv already exists. Skipping download. @ labels-test.tsv already exists. Skipping download. - All files downloaded successfully.
1.2 Encode into a dataframe¶
def files_to_dataframe(files: list, path: str = "corpus") -> dict:
"""
Reads multiple files into DataFrames and returns a dictionary.
Args:
files (list): List of file names to be read.
path (str, optional): Path to the directory containing the files. Defaults to "corpus".
Returns:
dict: Dictionary containing DataFrames, where keys are file names and values are DataFrames.
"""
dfs_dict = {}
for file in files:
file_path = os.path.join(path, file)
dfs_dict[file] = pd.read_csv(file_path, sep='\t', header=0)
return dfs_dict
dfs_dict_1 = files_to_dataframe(files)
showcase_dict_dataframes(dfs_dict_1)
DataFrame Name: arguments-training.tsv | Shape: (5393, 4)
| Argument ID | Conclusion | Stance | Premise | |
|---|---|---|---|---|
| 0 | A01002 | We should ban human cloning | in favor of | we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same. |
| 1 | A01005 | We should ban fast food | in favor of | fast food should be banned because it is really bad for your health and is costly. |
DataFrame Name: arguments-validation.tsv | Shape: (1896, 4)
| Argument ID | Conclusion | Stance | Premise | |
|---|---|---|---|---|
| 0 | A01001 | Entrapment should be legalized | in favor of | if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal? |
| 1 | A01012 | The use of public defenders should be mandatory | in favor of | the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't |
DataFrame Name: arguments-test.tsv | Shape: (1576, 4)
| Argument ID | Conclusion | Stance | Premise | |
|---|---|---|---|---|
| 0 | A26004 | We should end affirmative action | against | affirmative action helps with employment equity. |
| 1 | A26010 | We should end affirmative action | in favor of | affirmative action can be considered discriminatory against poor whites |
DataFrame Name: labels-training.tsv | Shape: (5393, 21)
| Argument ID | Self-direction: thought | Self-direction: action | Stimulation | Hedonism | Achievement | Power: dominance | Power: resources | Face | Security: personal | Security: societal | Tradition | Conformity: rules | Conformity: interpersonal | Humility | Benevolence: caring | Benevolence: dependability | Universalism: concern | Universalism: nature | Universalism: tolerance | Universalism: objectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A01002 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | A01005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DataFrame Name: labels-validation.tsv | Shape: (1896, 21)
| Argument ID | Self-direction: thought | Self-direction: action | Stimulation | Hedonism | Achievement | Power: dominance | Power: resources | Face | Security: personal | Security: societal | Tradition | Conformity: rules | Conformity: interpersonal | Humility | Benevolence: caring | Benevolence: dependability | Universalism: concern | Universalism: nature | Universalism: tolerance | Universalism: objectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A01001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | A01012 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
DataFrame Name: labels-test.tsv | Shape: (1576, 21)
| Argument ID | Self-direction: thought | Self-direction: action | Stimulation | Hedonism | Achievement | Power: dominance | Power: resources | Face | Security: personal | Security: societal | Tradition | Conformity: rules | Conformity: interpersonal | Humility | Benevolence: caring | Benevolence: dependability | Universalism: concern | Universalism: nature | Universalism: tolerance | Universalism: objectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A26004 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1 | A26010 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
1.3 Merge arguments and labels¶
def merge_arguments_labels(dfs_dict: dict) -> dict:
"""
Merges arguments and labels DataFrames for each split and returns a new dictionary.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
Returns:
dict: Dictionary containing merged DataFrames, where keys are split names and values are merged DataFrames.
"""
merged_dfs_dict = {}
for split in ["training", "validation", "test"]:
merged_dfs_dict[split] = pd.merge(dfs_dict[f"arguments-{split}.tsv"], dfs_dict[ f"labels-{split}.tsv"], on="Argument ID")
return merged_dfs_dict
dfs_dict_2 = merge_arguments_labels(dfs_dict_1)
showcase_dict_dataframes(dfs_dict_2)
DataFrame Name: training | Shape: (5393, 24)
| Argument ID | Conclusion | Stance | Premise | Self-direction: thought | Self-direction: action | Stimulation | Hedonism | Achievement | Power: dominance | Power: resources | Face | Security: personal | Security: societal | Tradition | Conformity: rules | Conformity: interpersonal | Humility | Benevolence: caring | Benevolence: dependability | Universalism: concern | Universalism: nature | Universalism: tolerance | Universalism: objectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A01002 | We should ban human cloning | in favor of | we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | A01005 | We should ban fast food | in favor of | fast food should be banned because it is really bad for your health and is costly. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DataFrame Name: validation | Shape: (1896, 24)
| Argument ID | Conclusion | Stance | Premise | Self-direction: thought | Self-direction: action | Stimulation | Hedonism | Achievement | Power: dominance | Power: resources | Face | Security: personal | Security: societal | Tradition | Conformity: rules | Conformity: interpersonal | Humility | Benevolence: caring | Benevolence: dependability | Universalism: concern | Universalism: nature | Universalism: tolerance | Universalism: objectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A01001 | Entrapment should be legalized | in favor of | if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal? | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | A01012 | The use of public defenders should be mandatory | in favor of | the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
DataFrame Name: test | Shape: (1576, 24)
| Argument ID | Conclusion | Stance | Premise | Self-direction: thought | Self-direction: action | Stimulation | Hedonism | Achievement | Power: dominance | Power: resources | Face | Security: personal | Security: societal | Tradition | Conformity: rules | Conformity: interpersonal | Humility | Benevolence: caring | Benevolence: dependability | Universalism: concern | Universalism: nature | Universalism: tolerance | Universalism: objectivity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A26004 | We should end affirmative action | against | affirmative action helps with employment equity. | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1 | A26010 | We should end affirmative action | in favor of | affirmative action can be considered discriminatory against poor whites | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
1.4 Merge level 2 annotations to level 3 categories¶
def merge_subcategories(dfs_dict: dict, level_2_to_level_3: dict) -> dict:
"""
Performs the aggregation of level 2 categories into level 3 categories for each DataFrame in the input dictionary.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
level_2_to_level_3 (dict): Dictionary containing the starting and ending column indices for each level 3 category.
Returns:
dict: Dictionary containing DataFrames with level 3 categories.
"""
level3_dfs_dict = {}
first_column, last_column = list(level_2_to_level_3.values())[0][0], list(level_2_to_level_3.values())[-1][1]
for name, df in dfs_dict.items():
new_df = df.copy()
for level_3_category, (start_column, end_column) in level_2_to_level_3.items():
new_df[level_3_category] = new_df.loc[:, start_column:end_column].apply(
lambda row: 1 if row.any() else 0, axis=1
)
new_df.drop(df.loc[:, first_column:last_column].columns, axis=1, inplace=True)
level3_dfs_dict[name] = new_df
return level3_dfs_dict
# This dictionary contains the non unique mapping of level 2 columns into level 3 macrocategories
level_2_to_level_3 = {
"Openness to change": ["Self-direction: thought", "Hedonism"],
"Self-enhancement": ["Hedonism", "Face"],
"Conservation": ["Face", "Humility"],
"Self-transcendence": ["Humility", "Universalism: objectivity"]
}
dfs_dict_3 = merge_subcategories(dfs_dict_2, level_2_to_level_3)
showcase_dict_dataframes(dfs_dict_3)
DataFrame Name: training | Shape: (5393, 8)
| Argument ID | Conclusion | Stance | Premise | Openness to change | Self-enhancement | Conservation | Self-transcendence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A01002 | We should ban human cloning | in favor of | we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same. | 0 | 0 | 1 | 0 |
| 1 | A01005 | We should ban fast food | in favor of | fast food should be banned because it is really bad for your health and is costly. | 0 | 0 | 1 | 0 |
DataFrame Name: validation | Shape: (1896, 8)
| Argument ID | Conclusion | Stance | Premise | Openness to change | Self-enhancement | Conservation | Self-transcendence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A01001 | Entrapment should be legalized | in favor of | if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal? | 0 | 0 | 1 | 0 |
| 1 | A01012 | The use of public defenders should be mandatory | in favor of | the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't | 0 | 0 | 0 | 1 |
DataFrame Name: test | Shape: (1576, 8)
| Argument ID | Conclusion | Stance | Premise | Openness to change | Self-enhancement | Conservation | Self-transcendence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A26004 | We should end affirmative action | against | affirmative action helps with employment equity. | 0 | 1 | 1 | 1 |
| 1 | A26010 | We should end affirmative action | in favor of | affirmative action can be considered discriminatory against poor whites | 0 | 1 | 0 | 1 |
1.5 Data analysis¶
1.5.1 Level three categories¶
def pie_plot(dfs_dict: dict, keys: list, title: str) -> None:
"""
Plots the distribution of a set of keys for each split. Used to visualize keys distributions.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
keys (list): List of keys to plot.
title (str): Title of the plot.
Returns:
None
"""
subplots = make_subplots(rows=1, cols=len(dfs_dict), specs=[[{"type": "pie"}] * len(dfs_dict)],
subplot_titles=list(dfs_dict.keys()))
fig = go.FigureWidget(subplots)
for i, (name, df) in enumerate(dfs_dict.items()):
fig.add_trace(go.Pie(labels=keys, values=[df[key].sum() for key in keys],
marker=dict(colors=px.colors.qualitative.Set1), hole=.3),
row=1, col=i + 1)
fig.update_layout(title_text=title, legend=dict(traceorder='reversed'))
fig.show()
def bar_plot(dfs_dict: dict, keys: list, title: str) -> None:
"""
Plots the distribution of a set of keys for each split. Used to visualize values distributions.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
keys (list): List of keys to plot.
title (str): Title of the plot.
Returns:
None
"""
subplots = make_subplots(rows=1, cols=len(dfs_dict), subplot_titles=list(dfs_dict.keys()))
fig = go.FigureWidget(subplots)
for i, (name, df) in enumerate(dfs_dict.items()):
values_list = [df[key].value_counts().sort_index() for key in keys]
x_labels = list(values_list[0].index)
for j, values in enumerate(values_list):
fig.add_trace(go.Bar(x=x_labels, y=values.values, name=keys[j],
showlegend=False if keys[j] in [k['name'] for k in fig.data[:]] else True,
marker_color=px.colors.qualitative.Set1[j]),
row=1, col=i + 1)
fig.update_layout(title_text=title, barmode='group')
fig.show()
def heatmap_plot(dfs_dict: dict, keys: list, group_column: str, title: str, colorscale: str, rows: bool = False) -> None:
"""
Plots a heatmap showing the occurrence of pairs of values in DataFrames, along with the percentages.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
keys (list): List of keys to plot.
group_column (str): Column to group by for counting occurrences.
title (str): Title of the plot.
colorscale (str): The colorscale for the heatmap.
rows (bool): If True, arrange subplots in rows; otherwise, arrange in columns. Defaults to False.
Returns:
None
"""
n_rows, n_cols = (len(dfs_dict), 1) if rows else (1, len(dfs_dict))
subplots = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=list(dfs_dict.keys()))
fig = go.FigureWidget(subplots)
for i, (name, df) in enumerate(dfs_dict.items()):
if group_column:
df_new = df.groupby(group_column).sum()[keys]
df_new = df_new.div(df_new.sum(axis=0), axis=1)
x_values = df_new.columns.tolist()
y_values = df_new.index.tolist()
else:
# Create an empty co-occurrence matrix between the values in keys
co_occurrence_matrix = df[keys].T.dot(df[keys])
np.fill_diagonal(co_occurrence_matrix.values, 0) # Exclude diagonal values
df_new = co_occurrence_matrix.div(co_occurrence_matrix.sum(axis=0), axis=1)
x_values = keys
y_values = keys
values = [[f'{value:.3f}' for value in row] for row in df_new.values.tolist()]
fig.add_trace(go.Heatmap(z=df_new.values.tolist(),
x=x_values,
y=y_values,
colorscale=colorscale,
zmin=0,
zmax=1,
showlegend=False,
text=values,
texttemplate="%{text}",
textfont={"size": 12}),
row=i // n_cols + 1, col=i % n_cols + 1)
fig.update_layout(title_text=title)
fig.show()
level_3_labels = list(dfs_dict_3['training'].columns[4:])
pie_plot(dfs_dict_3, level_3_labels, "Level 3 Categories human values distribution")
bar_plot(dfs_dict_3, level_3_labels, "Level 3 Categories human values distribution")
The dataset appears to be imbalanced, with arguments commonly concerning human values such as Conversation and Self-transcendence, while the opposite is observed for Openness to change and Self-enanchement, which are less represented; It turns out that values such as Conversation and Self-transcendence, and Openness to change and Self-enhancement.
1.5.2 Co-occurrences¶
heatmap_plot(dfs_dict=dfs_dict_3, keys=level_3_labels, group_column='Stance', title="Stance and human values co-occurrence", colorscale = 'RdBu', rows=True)
The distribution of different Stance values is balanced throughout the datasets, meaning that people tend to have different perspectives on arguments. This suggests that considering this field in our models' input may not improve their discrimination capability.
heatmap_plot(dfs_dict=dfs_dict_3, keys=level_3_labels, group_column=None, title="Values co-occurrence", colorscale = 'Aggrnyl',rows=True)
Self-transcendence and Conversation co-occur often, even if Hedonism, their common shared level 2 category, is rarely observed in the datasets. It suggests that there could be a hidden dependence in the samples.
1.6 Data pre-processing¶
def preprocess_dataframes(dfs_dict: dict) -> dict:
"""
Preprocesses each dataframe in the given dictionary and returns a new dictionary with preprocessed dataframes.
Args:
dfs_dict (dict): Dictionary containing DataFrames.
Returns:
dict: Dictionary containing preprocessed DataFrames.
"""
preprocessed_dfs = {}
for name, df in dfs_dict.items():
preprocessed_df = df.copy()
preprocessed_df["Stance"].replace({'in favor of': 1, 'against': 0}, inplace=True)
preprocessed_dfs[name] = preprocessed_df
return preprocessed_dfs
dfs_dict_4 = preprocess_dataframes(dfs_dict_3)
showcase_dict_dataframes(dfs_dict_4)
DataFrame Name: training | Shape: (5393, 8)
| Argument ID | Conclusion | Stance | Premise | Openness to change | Self-enhancement | Conservation | Self-transcendence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A01002 | We should ban human cloning | 1 | we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same. | 0 | 0 | 1 | 0 |
| 1 | A01005 | We should ban fast food | 1 | fast food should be banned because it is really bad for your health and is costly. | 0 | 0 | 1 | 0 |
DataFrame Name: validation | Shape: (1896, 8)
| Argument ID | Conclusion | Stance | Premise | Openness to change | Self-enhancement | Conservation | Self-transcendence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A01001 | Entrapment should be legalized | 1 | if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal? | 0 | 0 | 1 | 0 |
| 1 | A01012 | The use of public defenders should be mandatory | 1 | the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't | 0 | 0 | 0 | 1 |
DataFrame Name: test | Shape: (1576, 8)
| Argument ID | Conclusion | Stance | Premise | Openness to change | Self-enhancement | Conservation | Self-transcendence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A26004 | We should end affirmative action | 0 | affirmative action helps with employment equity. | 0 | 1 | 1 | 1 |
| 1 | A26010 | We should end affirmative action | 1 | affirmative action can be considered discriminatory against poor whites | 0 | 1 | 0 | 1 |
[Task 2 - 2.0 points] Model definition¶
You are tasked to define several neural models for multi-label classification.
Instructions¶
- Baseline: implement a random uniform classifier (an individual classifier per category).
- Baseline: implement a majority classifier (an individual classifier per category).
- BERT w/ C: define a BERT-based classifier that receives an argument conclusion as input.
- BERT w/ CP: add argument premise as an additional input.
- BERT w/ CPS: add argument premise-to-conclusion stance as an additional input.
Notes¶
Do not mix models. Each model has its own instructions.
You are free to select the BERT-based model card from huggingface.
Examples¶
bert-base-uncased
prajjwal1/bert-tiny
distilbert-base-uncased
roberta-base
model_name = 'prajjwal1/bert-tiny'
tokenizer = AutoTokenizer.from_pretrained(model_name)
PyTorch offers the possibility to create a Custom Dataset for our files, overriding a class and implementing three functions: __init__, __len__, and __getitem__.
class HumanValuesDataset(Dataset):
def __init__(self, dataframe: pd.DataFrame, tokenizer, text_columns: dict, numerical_columns: list, label_columns: list):
self.data = dataframe
self.tokenizer = tokenizer
self.text_columns = text_columns
self.numerical_columns = numerical_columns
self.label_columns = label_columns
def __len__(self) -> int:
return len(self.data)
def __getitem__(self, idx: int) -> dict:
row = {}
labels = torch.tensor(self.data.iloc[idx][self.label_columns].values.astype(np.float32))
# Text
for col, max_length in self.text_columns.items():
text = str(self.data.iloc[idx][col])
encoding = self.tokenizer(text,
truncation=True,
max_length=max_length,
padding='max_length',
return_tensors='pt')
row[col.lower()] = {
'input_ids': encoding['input_ids'].squeeze(0),
'attention_mask': encoding['attention_mask'].squeeze(0),
'token_type_ids': encoding["token_type_ids"].squeeze(0)
}
# Numerical
for col in self.numerical_columns:
row[col.lower()] = torch.tensor([self.data[col][idx]])
row['labels'] = labels
return row
Finally, we load that datasets using the DataLoader class, which can iterate through the dataset as needed. Each iteration returns a batch of features and labels (containing batch_size=16 features and labels respectively). Because we specified shuffle=True, after we iterate over all batches the data is shuffled.
From HuggingFace we hacknowledge that the maximum sequence length that our model can handle is 512.
numerical_columns = ['Stance']
text_columns_names = ['Conclusion', 'Premise']
text_columns = {col:0 for col in text_columns_names}
for col in text_columns.keys():
idx = dfs_dict_4['training'][col].str.len().idxmax()
txt = dfs_dict_4['training'][col][idx]
text_columns[col] = len(txt)
if text_columns[col] < 512:
print(f"Longest '{col}' in training split:\n\t'{txt}'\n\tLength:{text_columns[col]}\n")
else:
print(f"Longest '{col}' in training split:\n\t'{txt}'\n\tLength:{text_columns[col]} (Truncated to 512)\n")
text_columns[col] = 512
Longest 'Conclusion' in training split: 'The best way to save the world from climate change and protect the environment is to encourage everyone to start with themselves and look at what things they can do to help with this problem' Length:190 Longest 'Premise' in training split: 'According to the United Nations Convention on the rights of people with disabilities, the European Union “shall closely consult with and actively involve persons with disabilities” on political decisions that concern them. Meanwhile the “European Strategy for the Rights of Persons with Disabilities 2021-2030” hardly mentioned people with intellectual and neurological disabilities. Now the Conference for the Future of Europe wants to be an inclusive citizen consultation but is still not making it a priority to include citizens with Trisomy 21, autism, or members of other neurodivergent communities. Ableism keeps getting perpetuated in the EU and it needs to stop. We want more representation! We want people with Down Syndrome to be fully included in the European Union institutions!' Length:792 (Truncated to 512)
train_dataset = HumanValuesDataset(dfs_dict_4['training'], tokenizer, text_columns=text_columns,\
numerical_columns=numerical_columns,\
label_columns=level_3_labels)
val_dataset = HumanValuesDataset(dfs_dict_4['validation'], tokenizer, text_columns=text_columns,\
numerical_columns=numerical_columns,\
label_columns=level_3_labels)
test_dataset = HumanValuesDataset(dfs_dict_4['test'], tokenizer, text_columns=text_columns,\
numerical_columns=numerical_columns,\
label_columns=level_3_labels)
def seed_worker(worker_id):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
batch_size = 32
g = torch.Generator()
g.manual_seed(42)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)
Let us explore a sample to ensure the correctness of our methodology.
x = next(iter(train_loader))
for item, item_value in x.items():
print(f"{item}:")
try:
for el, value in item_value.items():
print(f"\t{el} [shape: {value.shape}]")
except:
print(f"\tshape: {item_value.shape}")
print()
conclusion: input_ids [shape: torch.Size([32, 190])] attention_mask [shape: torch.Size([32, 190])] token_type_ids [shape: torch.Size([32, 190])] premise: input_ids [shape: torch.Size([32, 512])] attention_mask [shape: torch.Size([32, 512])] token_type_ids [shape: torch.Size([32, 512])] stance: shape: torch.Size([32, 1]) labels: shape: torch.Size([32, 4])
2.2 Baselines¶
We are now going to define the two requested baselines, respectively a random uniform classifier and a majority classifier.
class RandomUniformClassifier(nn.Module):
def __init__(self, num_labels):
"""
Initializes a new instance of the RandomUniformClassifier class.
Args:
num_labels (int): The number of labels/classes in the classification task.
Returns:
None
"""
super(RandomUniformClassifier, self).__init__()
self.num_labels = num_labels
def forward(self, x):
"""
Defines the forward pass of the random uniform classifier.
Args:
x (torch.Tensor): Input tensor representing the features.
Returns:
torch.Tensor: A tensor representing the randomly generated predictions for each sample in the input batch.
"""
return torch.randint(0, 2, (x['conclusion']['input_ids'].shape[0], self.num_labels)).float()
class MajorityClassifier(nn.Module):
def __init__(self):
"""
Initializes a new instance of the MajorityClassifier class.
Returns:
None
"""
super(MajorityClassifier, self).__init__()
def forward(self, x):
"""
Defines the forward pass of the majority classifier.
Args:
x (torch.Tensor): Input tensor representing the features.
Returns:
torch.Tensor: A tensor representing the majority labels repeated for each sample in the batch.
"""
return self.results.repeat(x['conclusion']['input_ids'].size(0), 1)
def fit(self, dataset):
"""
Fits the majority classifier to the given dataset.
Args:
dataset (torch.utils.data.Dataset): The dataset containing labeled samples.
Returns:
None
"""
labels = torch.stack([sample['labels'] for sample in dataset])
self.results = (torch.count_nonzero(labels, dim=0) > len(dataset) / 2).float()
2.3 BERT models¶
We are now going to define the BERT models according to what the task asks.
Input concatenation¶
logging.set_verbosity_error()
class BERTModule(nn.Module):
def __init__(self, model_name, num_labels):
super(BERTModule, self).__init__()
self.model = AutoModel.from_pretrained(model_name)
def forward(self, x):
return self.model(input_ids=x['input_ids'], \
attention_mask=x['attention_mask'], \
token_type_ids=x['token_type_ids']).pooler_output
class ClassificationHead(nn.Module):
def __init__(self, input_size, num_labels):
super(ClassificationHead, self).__init__()
self.dropout = torch.nn.Dropout(p=0.2)
self.fc = nn.Linear(input_size, num_labels)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.dropout(x)
x = self.fc(x)
x = self.sigmoid(x)
return x
BERT w/ C¶
class CClassifier(nn.Module):
def __init__(self, model_name, num_labels):
super(CClassifier, self).__init__()
self.conclusion_module = BERTModule(model_name, num_labels)
size = self.conclusion_module.model.config.hidden_size
self.head = ClassificationHead(size, num_labels)
def forward(self, x):
h = self.conclusion_module(x['conclusion'])
y = self.head(h)
return y
BERT w/ CP¶
class CPClassifier(nn.Module):
def __init__(self, model_name, num_labels):
super(CPClassifier, self).__init__()
self.conclusion_module = BERTModule(model_name, num_labels)
self.premise_module = BERTModule(model_name, num_labels)
size = self.premise_module.model.config.hidden_size + \
self.conclusion_module.model.config.hidden_size
self.head = ClassificationHead(size, num_labels)
def forward(self, x):
h_1 = self.conclusion_module(x['conclusion'])
h_2 = self.premise_module(x['premise'])
y = self.head(torch.cat((h_1, h_2), dim=-1))
return y
BERT w/ CPS¶
class CPSClassifier(nn.Module):
def __init__(self, model_name, num_labels):
super(CPSClassifier, self).__init__()
self.conclusion_module = BERTModule(model_name, num_labels)
self.premise_module = BERTModule(model_name, num_labels)
size = self.premise_module.model.config.hidden_size + \
self.conclusion_module.model.config.hidden_size + 1
self.head = ClassificationHead(size, num_labels)
def forward(self, x):
h_1 = self.conclusion_module(x['conclusion'])
h_2 = self.premise_module(x['premise'])
y = self.head(torch.cat((h_1, h_2, x['stance']), dim=-1))
return y
Notes¶
The stance input has to be encoded into a numerical format.
You should use the same model instance to encode premise and conclusion inputs.
[Task 3 - 0.5 points] Metrics¶
Before training the models, you are tasked to define the evaluation metrics for comparison.
Instructions¶
- Evaluate your models using per-category binary F1-score.
- Compute the average binary F1-score over all categories (macro F1-score).
Example¶
You start with individual predictions ($\rightarrow$ samples).
Openess to change: 0 0 1 0 1 1 0 ...
Self-enhancement: 1 0 0 0 1 0 1 ...
Conversation: 0 0 0 1 1 0 1 ...
Self-transcendence: 1 1 0 1 0 1 0 ...
You compute per-category binary F1-score.
Openess to change F1: 0.35
Self-enhancement F1: 0.55
Conversation F1: 0.80
Self-transcendence F1: 0.21
You then average per-category scores.
Average F1: ~0.48
3.1 Metrics¶
We opted to use sklearn.metrics.f1_score.
Its average parameter accepts a binary value, which computes results for the class specified by pos_label; this allows to compute per-category binary F1-score.
The macro parameter, instead, calculates metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
from sklearn.metrics import f1_score
Our data analysis revealed a highly imbalanced dataset. While various techniques address this issue, calculating class weights remains a prevalent and effective approach.
By emphasizing the error (loss values) for under-represented classes, we encourage the model to focus on learning these classes effectively. This is achieved by computing class weights based on the inverse frequency of each class.
Given that the loss function $\mathcal{L}$ is defined as:
$$ \mathcal{L} = \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{K} w_i t_n^i \log(y_n^i) $$where:
- $N$ is the total number of samples.
- $K$ is the total number of classes.
- $w_i$ is the weight for class $i$.
- $t_n^i$ is the true label of sample $n$ for class $i$ (either 0 or 1).
- $y_n^i$ is the predicted probability of sample $n$ belonging to class $i$.
Classes with higher weights contribute more significantly to the overall loss. To counteract the model's bias towards frequent classes, we calculate class weights inversely proportional to the class frequencies:
$$ w_i = \frac{N}{K \sum_{n=1}^{N} t_n^i} $$This formula ensures that classes with fewer samples receive higher weights, balancing the influence of each class on the training process.
def compute_weights(dataset):
"""
Compute class weights based on the inverse of class frequencies in the dataset.
Args:
dataset (Dataset): The dataset containing samples with 'labels' attribute.
Returns:
torch.Tensor: A tensor containing the computed class weights.
"""
n_samples = len(dataset) # Total number of samples
n_classes = len(dataset[0]['labels']) # Total number of classes
# Count the number of samples for each class
n_samples_per_class = torch.count_nonzero(dataset[:]['labels'], dim=0)
# Compute weights
weights = torch.stack([n_samples / (n_classes * n_samples_j) for n_samples_j in n_samples_per_class])
return weights
class_weights = compute_weights(train_dataset)
print("Weights:", class_weights)
Weights: tensor([0.6813, 0.5417, 0.3283, 0.3284])
[Task 4 - 1.0 points] Training and Evaluation¶
You are now tasked to train and evaluate all defined models.
Instructions¶
- Train all models on the train set.
- Evaluate all models on the validation set.
- Pick at least three seeds for robust estimation.
- Compute metrics on the validation set.
- Report per-category and macro F1-score for comparison.
def move_to_device(data, device: str):
"""
Move data to the specified device.
Args:
data (Union[dict, torch.Tensor, Any]): Input data to be moved to the device.
device (str): The device to move the data to (e.g., 'cuda' or 'cpu').
Returns:
Union[dict, torch.Tensor, Any]: Data moved to the specified device.
"""
if isinstance(data, dict):
return {key: move_to_device(value, device) for key, value in data.items()}
elif isinstance(data, torch.Tensor):
return data.to(device)
else:
return data
def evaluate(model: nn.Module, loader: DataLoader, criterion: nn.Module, device: str, seed: int = 42):
"""
Evaluate the model on the given loader using the specified criterion.
Args:
model (nn.Module): The model to evaluate.
loader (DataLoader): The data loader for evaluation.
criterion (nn.Module): The criterion used for evaluation.
device (str): The device to run the evaluation on (e.g., 'cuda' or 'cpu').
seed (int): Random seed for reproducibility.
Returns:
Tuple[float, float, np.ndarray, List[np.ndarray], List[np.ndarray], List[np.ndarray]]: A tuple containing the loss, macro F1 score, per-category F1 scores, ground truth labels, predicted labels, and prediction scores.
"""
set_reproducibility(seed)
model.to(device)
model.eval()
criterion.to(device)
total_loss = 0.0
with torch.no_grad():
gts, preds, scores = [], [], []
for batch in loader:
batch = move_to_device(batch, device)
outputs = model(batch)
loss = criterion(outputs.to(device), batch['labels'])
total_loss += loss.item()
gts.extend(batch['labels'].cpu().numpy())
preds.extend(outputs.cpu().detach().numpy() > 0.5)
scores.extend(outputs.cpu().detach().numpy())
loss = total_loss / len(loader)
f1 = f1_score(gts, preds, average='macro')
per_category_f1 = f1_score(gts, preds, average=None)
return loss, f1, per_category_f1, gts, preds, scores
def train(model: nn.Module, train_loader: DataLoader, criterion: nn.Module, optimizer: torch.optim.Optimizer, device: str, epochs: int, seed: int, val_loader = None, verbose: bool = True):
"""
Train the given model using the specified criterion and optimizer.
Args:
model (nn.Module): The model to train.
train_loader (DataLoader): The data loader for training.
criterion (nn.Module): The criterion used for training.
optimizer (torch.optim.Optimizer): The optimizer used for training.
device (str): The device to run the training on (e.g., 'cuda' or 'cpu').
epochs (int): The number of epochs for training.
seed (int): Random seed for reproducibility.
val_loader (Optional[DataLoader]): The data loader for validation (default: None).
verbose (bool): Whether to print training progress (default: True).
Returns:
Tuple[float, nn.Module]: A tuple containing the best F1 score and the best model.
"""
set_reproducibility(seed)
model.to(device)
criterion.to(device)
best_f1, best_epoch, best_model = -1, None, None
train_losses, train_f1_scores = [], []
val_losses, val_f1_scores = [], []
save_path = os.path.join('checkpoints', f'{model.__class__.__name__}', str(seed))
os.makedirs(save_path, exist_ok=True)
subplots = make_subplots(rows=1, cols=2, subplot_titles=('Loss', 'F1 Score'))
fig = go.FigureWidget(subplots)
fig.update_layout(title_text=f'{model.__class__.__name__} - Seed [{seed}]')
display(fig)
# Training
for epoch in range(epochs):
model.train()
running_loss = 0.0
gts, preds = [], []
tqdm_loader = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}', leave=False)
for batch_idx, batch in enumerate(tqdm_loader):
batch = move_to_device(batch, device)
# Train step
optimizer.zero_grad()
outputs = model(batch)
loss = criterion(outputs.to(device), batch['labels'])
loss.backward()
optimizer.step()
running_loss += loss.item()
gts.extend(batch['labels'].cpu().numpy())
preds.extend(outputs.cpu().detach().numpy() > 0.5)
tqdm_loader.set_postfix({'loss': running_loss / (batch_idx + 1)})
# Compute F1 score for training set
train_loss = running_loss / len(train_loader)
f1_train = f1_score(gts, preds, average='macro')
train_losses.append(train_loss)
train_f1_scores.append(f1_train)
# Validation
val_loss, f1_val, _, _, _, _ = evaluate(model, val_loader, criterion, device, seed = seed)
val_losses.append(val_loss)
val_f1_scores.append(f1_val)
# Check if current F1 score is the best seen so far
if f1_val > best_f1:
best_f1 = f1_val
best_epoch = epoch + 1
torch.save(model.state_dict(), os.path.join(save_path, 'best_model.pth'))
train_loss_trace = go.Scatter(x=np.arange(len(train_losses)) + 1, y=train_losses, mode='lines+markers', name='Train Loss', line=dict(color='blue'), showlegend=True if epoch==0 else False)
val_loss_trace = go.Scatter(x=np.arange(len(val_losses)) + 1, y=val_losses, mode='lines+markers', name='Validation Loss', line=dict(color='red'), showlegend=True if epoch==0 else False)
train_f1_trace = go.Scatter(x=np.arange(len(train_f1_scores)) + 1, y=train_f1_scores, mode='lines+markers', name='Train F1 Score', line=dict(color='green'), showlegend=True if epoch==0 else False)
val_f1_trace = go.Scatter(x=np.arange(len(val_f1_scores)) + 1, y=val_f1_scores, mode='lines+markers', name='Validation F1 Score', line=dict(color='orange'), showlegend=True if epoch==0 else False)
fig.add_trace(train_loss_trace, row=1, col=1)
fig.add_trace(val_loss_trace, row=1, col=1)
fig.add_trace(train_f1_trace, row=1, col=2)
fig.add_trace(val_f1_trace, row=1, col=2)
# Save data
print(f'Saved model with best F1 score ({best_f1:.3f} at epoch {best_epoch}) - {save_path}\n')
return best_f1, best_model
def train_models(model_class: nn.Module, seeds: list, model_name: str, level_3_labels: list, train_loader: DataLoader, val_loader: DataLoader, device: str, class_weights: torch.Tensor = class_weights, epochs: int = 15):
"""
Train multiple models with different random seeds and return the best models.
Args:
model_class (Type[nn.Module]): The class of the model to train.
seeds (List[int]): List of random seeds for reproducibility.
model_name (str): Name of the model.
level_3_labels (List[str]): List of class labels.
train_loader (DataLoader): The data loader for training.
val_loader (DataLoader): The data loader for validation.
device (str): The device to run the training on (e.g., 'cuda' or 'cpu').
class_weights (torch.Tensor): Class weights for training (default: class_weights).
epochs (int): Number of epochs for training (default: 15).
Returns:
Dict[int, Dict[str, Union[float, nn.Module]]]: A dictionary containing the best F1 score and the corresponding best model for each seed.
"""
best_models = {}
best_model = None
for seed in seeds:
best_models[seed] = {}
model = model_class(model_name, len(level_3_labels))
# Freeze layers except for head and dense layers
for name, param in model.named_parameters():
if not ("head" in name or "dense" in name):
param.requires_grad = False
criterion = nn.BCELoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Train the model
best_f1, best_model = train(model, train_loader, criterion, optimizer, device, epochs=epochs, seed=seed, val_loader=val_loader)
best_models[seed] = {'f1': best_f1, 'model': best_model}
# Calculate and print the average F1 score over all seeds
average_best_f1 = sum([result['f1'] for result in best_models.values()]) / len([result['f1'] for result in best_models.values()])
print("Average of best F1 scores:", average_best_f1)
seeds = [7, 42, 99]
train_models(CClassifier, seeds, model_name, level_3_labels, train_loader, val_loader, device)
Saved model with best F1 score (0.689 at epoch 14) - checkpoints/CClassifier/7
Saved model with best F1 score (0.673 at epoch 13) - checkpoints/CClassifier/42
Saved model with best F1 score (0.676 at epoch 14) - checkpoints/CClassifier/99 Average of best F1 scores: 0.6792353234220948
train_models(CPClassifier, seeds, model_name, level_3_labels, train_loader, val_loader, device)
Saved model with best F1 score (0.739 at epoch 3) - checkpoints/CPClassifier/7
Saved model with best F1 score (0.728 at epoch 11) - checkpoints/CPClassifier/42
Saved model with best F1 score (0.723 at epoch 12) - checkpoints/CPClassifier/99 Average of best F1 scores: 0.7301600711186227
train_models(CPSClassifier, seeds, model_name, level_3_labels, train_loader, val_loader, device)
Saved model with best F1 score (0.733 at epoch 3) - checkpoints/CPSClassifier/7
Saved model with best F1 score (0.740 at epoch 9) - checkpoints/CPSClassifier/42
Saved model with best F1 score (0.722 at epoch 4) - checkpoints/CPSClassifier/99 Average of best F1 scores: 0.7311993411062522
We will now evaluate over the test set to choose the best model to use for error analysis.
def evaluation_charts(plots: dict, rows: int = 2, cols: int = 2):
"""
Generate evaluation charts for the given plots.
Args:
plots (dict): A dictionary containing plot titles as keys and plot data as values.
Plot data should be in the format {'labels': [...], 'data': [...]}.
rows (int): Number of rows in the subplot grid.
cols (int): Number of columns in the subplot grid.
Returns:
None
"""
subplots = make_subplots(rows=rows, cols=cols, subplot_titles=[plot['title'] for plot in plots.values()])
fig = go.FigureWidget(subplots)
row, col = 1, 1
for title, plot_data in plots.items():
colors = ['rgb(228, 26, 28)', 'rgb(55, 126, 184)', 'rgb(77, 175, 74)', 'rgb(152, 78, 163)', 'rgb(255, 127, 0)']
data = go.Bar(
x=plot_data['labels'],
y=plot_data['data'],
marker=dict(color=colors)
)
fig.add_trace(data, row=row, col=col)
fig.update_yaxes(title_text="F1 Score", row=row, col=col)
for i, val in enumerate(plot_data['data']):
fig.add_annotation(
x=plot_data['labels'][i],
y=val,
text=str(round(val, 3)),
showarrow=False,
font=dict(color='white', size=11),
xanchor='center',
yanchor='middle',
row=row,
col=col,
yshift=-10
)
col += 1
if col > 2:
col = 1
row += 1
fig.update_layout(title="Evaluation Results", showlegend=False, margin=dict(l=60, r=60, t=60, b=40))
pyo.iplot(fig)
def evaluate_models(model_class, seeds: list, model_name: str, level_3_labels: list, loader: DataLoader, device: str, class_weights=class_weights, rows: int = 2, cols: int = 2):
"""
Evaluate models with given parameters and visualize evaluation results.
Args:
model_class: The class of the model to evaluate.
seeds (list): List of seed values for reproducibility.
model_name (str): Name of the model.
level_3_labels (list): List of level 3 labels.
loader: Data loader for evaluation.
device: Device to run the evaluation on (e.g., 'cuda' or 'cpu').
class_weights: Weights for balancing class distribution.
rows (int): Number of rows in the evaluation charts subplot.
cols (int): Number of columns in the evaluation charts subplot.
Returns:
dict: Dictionary containing evaluation results including the best model, its F1 score, and average F1 scores.
"""
plots, results = {}, {}
best_model = None
best_f1, avg_loss, avg_f1 = -1, 0, 0
avg_per_category_f1 = [0] * len(level_3_labels)
for seed in seeds:
if model_class == RandomUniformClassifier:
model = model_class(len(level_3_labels))
elif model_class == MajorityClassifier:
model = model_class()
model.fit(train_dataset)
else:
model = model_class(model_name, len(level_3_labels))
checkpoint_path = os.path.join('checkpoints', f'{model.__class__.__name__}', str(seed))
model.load_state_dict(torch.load(os.path.join(checkpoint_path, 'best_model.pth'), map_location=device), strict=False)
criterion = nn.BCELoss(weight=class_weights)
# Evaluate the model
_, f1, per_category_f1, _, _, _ = evaluate(model, loader, criterion, device, seed)
# Update best model if current model has higher average F1 score
if f1 > best_f1:
best_model = model
best_f1 = f1
best_seed = seed
avg_f1 += f1
avg_per_category_f1 = [x + y for x, y in zip(avg_per_category_f1, per_category_f1)]
plots[seed] = {'title': f"Seed {seed} Evaluation Results", 'labels': level_3_labels + ["Macro"], 'data': list(per_category_f1) + [f1]}
# Calculate average F1 score, and per-category F1 score
avg_f1 /= len(seeds)
avg_per_category_f1 = [x / len(seeds) for x in avg_per_category_f1]
if rows > 1 or cols > 1:
plots['Average'] = {'title': "Average Evaluation Results", 'labels': level_3_labels + ["Macro"], 'data': list(avg_per_category_f1) + [avg_f1]}
evaluation_charts(plots, rows, cols)
# Prepare results dictionary
results = {
'model_class': model_class.__name__,
'best_model': best_model,
'best_f1': best_f1,
'best_seed': best_seed,
'avg_f1': avg_f1,
'avg_per_category_f1': avg_per_category_f1
}
return results
evaluations_val = []
evaluations_val.append(evaluate_models(CClassifier, seeds, model_name, level_3_labels, val_loader, device))
evaluations_val.append(evaluate_models(CPClassifier, seeds, model_name, level_3_labels, val_loader, device))
evaluations_val.append(evaluate_models(CPSClassifier, seeds, model_name, level_3_labels, val_loader, device))
best_val_model = max(evaluations_val, key=lambda x: x['avg_f1'])
print(f"The best performing model on the validation set is {best_val_model['model_class']}.")
print(f"\tMacro F1 score [{best_val_model['avg_f1']:.3f}]")
print(f"\tBest seed: {best_val_model['best_seed']}.")
The best performing model on the validation set is CPSClassifier. Macro F1 score [0.731] Best seed: 42.
The best performing model is the third one, with slightly higher average F1 scores compared to the model that includes Stance among its inputs.
The similar performances with the CP Model suggests that is could be attributed to chance, confirming our intuition that the Stance values, being equally distributed among samples, do not contribute to improving the discriminative performance.
The ability to correctly predict Openness to Change highly contributes to the score differences, followed by Self-Enhancement, which actually are the these classes are less represented. We will further investigate in the following section.
[Task 5 - 1.0 points] Error Analysis¶
You are tasked to discuss your results.
Instructions¶
- Compare classification performance of BERT-based models with respect to baselines.
- Discuss difference in prediction between the best performing BERT-based model and its variants.
Notes¶
You can check the original paper for suggestions on how to perform comparisons (e.g., plots, tables, etc...).
5.1 Best model on test set¶
First of all, we are going to pick the model performing better on the test set.
evaluations_test = []
evaluations_test.append(evaluate_models(CClassifier, seeds, model_name, level_3_labels, test_loader, device))
evaluations_test.append(evaluate_models(CPClassifier, seeds, model_name, level_3_labels, test_loader, device))
evaluations_test.append(evaluate_models(CPSClassifier, seeds, model_name, level_3_labels, test_loader, device))
best_test_model = max(evaluations_test, key=lambda x: x['avg_f1'])
print(f"The best performing model on the test set is {best_test_model['model_class']}.")
print(f"\tMacro F1 score [{best_test_model['avg_f1']:.3f}]")
print(f"\tBest seed: {best_test_model['best_seed']}.")
The best performing model on the test set is CPSClassifier. Macro F1 score [0.698] Best seed: 42.
The classifier leveraging on Conclusion, Premise and Stance performs slightly better on the test set as well. This is not surprising, as the two sets come from the same distribution.
5.2 Comparison with baselines¶
Now let's see how the baselines perform. This time, we will evaluate them using only the best seed, and employ sklearn.metrics.classification_report, sklearn.metrics.precision_recall_curve and sklearn.metrics.average_precision_score.
from sklearn.metrics import classification_report, precision_recall_curve, average_precision_score
random_uniform_result = evaluate_models(RandomUniformClassifier, [best_test_model['best_seed']], model_name, level_3_labels, test_loader, device, rows = 1, cols = 1)
majority_result = evaluate_models(MajorityClassifier, [best_test_model['best_seed']], model_name, level_3_labels, test_loader, device, rows = 1, cols = 1)
def plot_classification_reports(reports: list):
"""
Plot multiple classification reports as subplots with heatmaps.
Args:
reports (list): A list of dictionaries, where each dictionary contains keys 'model_class' and 'report'.
Returns:
None
"""
subplots = make_subplots(rows=1, cols=len(reports), subplot_titles=[report['model_class'] for report in reports])
fig = go.FigureWidget(subplots)
for i, report in enumerate(reports, start=1):
report_dict = report['report']
classes = list(report_dict.keys())
header = list(report_dict[classes[0]].keys())[:-1]
values = [[round(report_dict[class_][metric], 3) for metric in header] for class_ in classes]
fig.add_trace(
go.Heatmap(z=values, x=header, y=classes, colorscale='Magenta', text=values, texttemplate="%{text}", textfont={"size": 12}, hoverinfo='text'),
row=1, col=i
)
fig.update_layout(title='Classification Reports')
fig.update_traces(showscale=False)
fig.show()
# Evaluate BERT model
_, _, _, gts_bert, preds_bert, scores_bert = evaluate(best_test_model['best_model'], test_loader, nn.BCELoss(weight=class_weights), device, best_test_model['best_seed'])
bert_report = classification_report(preds_bert, gts_bert, zero_division=0.0, output_dict=True, target_names=level_3_labels)
# Evaluate Random Uniform Classifier
_, _, _, gts_rand, preds_rand, scores_rand = evaluate(random_uniform_result['best_model'], test_loader, nn.BCELoss(weight=class_weights), device, best_test_model['best_seed'])
random_uniform_report = classification_report(preds_rand, gts_rand, zero_division=0.0, output_dict=True, target_names=level_3_labels)
# Evaluate Majority Classifier
_, _, _, gts_maj, preds_maj, scores_maj = evaluate(majority_result['best_model'], test_loader, nn.BCELoss(weight=class_weights), device, best_test_model['best_seed'])
majority_report = classification_report(preds_maj, gts_maj, zero_division=0.0, output_dict=True, target_names=level_3_labels)
reports_list = [
{'model_class': f"{best_test_model['model_class']}", 'report': bert_report},
{'model_class': f"{random_uniform_result['model_class']}", 'report': random_uniform_report},
{'model_class': f"{majority_result['model_class']}", 'report': majority_report}
]
plot_classification_reports(reports_list)
Here is what we can deduce from this comparison:
The BERT model generally outperforms both the random classifier and the majority classifier in terms of precision, recall, and F1-score.
The random classifier's recall is lower for less represented classes. Since recall represents the ability to detect the positive instances of a class, this makes sense because we randomly classify correctly the most represented labels.
As expected, the majority classifier achieves relatively high recall for the
ConversationandSelf-transcendencelabels due to always predicting the most represented labels as positive. Its micro results stress the importance of using macro metrics in tasks like this one, otherwise the lack of precision in predicting the less represented classes could remain hidden.Macro F1 score is confirmed as the best metric to perform the comparison, as it is insensitive to the imbalance of the classes and treats them all as equal.
5.3 Precision-recall curves¶
Let us now plot the precision-recall curves. The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).
def plot_pr_curves(y_true: list, y_scores: list, model_name: str, class_labels:list, rows: int = 2, cols: int = 2):
"""
Plot Precision-Recall curves for each class using Plotly with subplots.
Args:
y_true (list): True binary labels.
y_scores (list): Target scores, can either be probability estimates of the positive class or confidence values.
model_name (str): Name of the model for plot title.
class_labels (list): List of class labels.
rows (int): Number of rows in the subplots.
cols (int): Number of columns in the subplots.
Returns:
None
"""
subplots = make_subplots(rows=rows, cols=cols, subplot_titles=class_labels)
fig = go.FigureWidget(subplots)
row, col = 1, 1
for i, label in enumerate(class_labels, start=1):
y_class_true = [el[i-1] for el in y_true]
y_class_score = [el[i-1] for el in y_scores]
precision, recall, _ = precision_recall_curve(y_class_true, y_class_score)
auc_score = average_precision_score(y_class_true, y_class_score)
name = f"{label} (AP={auc_score:.2f})"
fig.add_trace(go.Scatter(x=recall, y=precision, name=name, mode='lines'), row=row, col=col)
fig.update_xaxes(title_text='Recall')
fig.update_yaxes(title_text='Precision')
col += 1
if col > 2:
col = 1
row += 1
fig.update_layout(
title=f'Precision-Recall Curves for {model_name}',
hovermode='closest'
)
fig.show()
plot_pr_curves(gts_bert, scores_bert, best_test_model['model_class'], level_3_labels)
Reviewing both precision and recall is valuable when there's an imbalance in observations between classes, such as many instances of class 0 and only a few of class 1, as is the case here. This is because the large number of class 0 examples means less interest in the model's skill at predicting class 0 correctly, which emphasizes high true negatives.
The precision-recall curve plots precision (y-axis) against recall (x-axis) for different thresholds. A no-skill classifier predicts random or constant class, and its baseline varies with the positive-to-negative class ratio, typically 0.5 for balanced datasets.
A model with perfect skill is at (1,1), while a skilful model curves towards (1,1) above the no-skill line. Notice how the bending of the curves for the most represented classes, Conversation and Self-transcendence, appears to manifest later with respect to the other two classes, Openness to change and Self-enhancement.
Composite scores like F1 score (harmonic mean of precision and recall) try to summarize this plot, which is why it is used in tasks like this one.
5.4 Predictions with different model variants¶
We are now going to examine some predictions and check how models with different inputs behave.
def predict(sample: dict, model: nn.Module, tokenizer, device: str):
"""
Predict labels for an input sample.
Args:
sample (dict): Input sample containing keys 'conclusion', 'premise', 'stance', and 'labels'.
model: The trained model for prediction.
tokenizer: The tokenizer used for tokenization.
device: Device to run the model on (e.g., 'cuda' or 'cpu').
Returns:
tuple: A tuple containing the untokenized conclusion and premise, stance, predicted labels and ground truth labels.
"""
# Tokenize conclusion and premise
conclusion = tokenizer.decode(sample['conclusion']['input_ids'][0], skip_special_tokens=True)
premise = tokenizer.decode(sample['premise']['input_ids'][0], skip_special_tokens=True)
stance = sample['stance'][0][0]
# Move data to device
sample = move_to_device(sample, device)
# Get model predictions
with torch.no_grad():
outputs = model(sample)
# Decode predicted labels and ground truth labels
label_names = ['Conversation', 'Openess to change', 'Self-enhancement', 'Self-transcendence']
preds = [{f"{label_names[i]}":pred > 0.5} for i, pred in enumerate(outputs.cpu().detach().numpy()[0])]
gts = [{f"{label_names[i]}":gt > 0.5} for i, gt in enumerate(sample['labels'].cpu().numpy()[0])]
return conclusion, premise, stance, preds, gts
single_test_loader = DataLoader(test_dataset, batch_size=1, shuffle=True, worker_init_fn=seed_worker, generator=g)
n_samples = 4
for idx, sample in enumerate(single_test_loader):
print(f"- Sample {idx+1}")
for variant in evaluations_test:
conclusion, premise, stance, preds, labels = predict(sample, variant['best_model'], tokenizer, device)
print(f"\tPredictions {variant['model_class']}:\n\t\t{preds}")
print(f"\tGround truth labels:\n\t\t{labels}")
print(f"\n\tPremise: {premise}")
print(f"\tConclusion: {conclusion}")
print(f"\tStance: {stance}")
print()
if idx == n_samples-1:
break
- Sample 1
Predictions CClassifier:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPClassifier:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPSClassifier:
[{'Conversation': False}, {'Openess to change': True}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Ground truth labels:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': False}]
Premise: this camp is too expensive to maintain, especially with such few prisoners.
Conclusion: we should close guantanamo bay detention camp
Stance: 1
- Sample 2
Predictions CClassifier:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPClassifier:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPSClassifier:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Ground truth labels:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': False}]
Premise: the universal declaration of human rights, especially article 14, states :'everyone has the right to seek and to enjoy in other countries asylum from persecution. '
Conclusion: we should create a migration system that reflects european values.
Stance: 1
- Sample 3
Predictions CClassifier:
[{'Conversation': True}, {'Openess to change': True}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPClassifier:
[{'Conversation': True}, {'Openess to change': True}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPSClassifier:
[{'Conversation': True}, {'Openess to change': True}, {'Self-enhancement': True}, {'Self-transcendence': False}]
Ground truth labels:
[{'Conversation': False}, {'Openess to change': True}, {'Self-enhancement': False}, {'Self-transcendence': True}]
Premise: affirmative action ensures the people from all backgrounds are able to advance.
Conclusion: we should end affirmative action
Stance: 0
- Sample 4
Predictions CClassifier:
[{'Conversation': True}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPClassifier:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Predictions CPSClassifier:
[{'Conversation': True}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
Ground truth labels:
[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': False}, {'Self-transcendence': True}]
Premise: equal rights for women makes it obvious that they also have the equal right to be part of our combat forces.
Conclusion: we should prohibit women in combat
Stance: 0
We can notice that the CClassifier tends to perform generally worse than the other two variants, which instead tend to output the same predictions.
When one of the highly represented classes is True, the models hardly fail to predict it correctly; this gives us a sense of the further imbalance occurring between positive and negative samples.
[Task 6 - 1.0 points] Report¶
Wrap up your experiment in a short report (up to 2 pages).
Instructions¶
- Use the NLP course report template.
- Summarize each task in the report following the provided template.
Recommendations¶
The report is not a copy-paste of graphs, tables, and command outputs.
- Summarize classification performance in Table format.
- Do not report command outputs or screenshots.
- Report learning curves in Figure format.
- The error analysis section should summarize your findings.
Submission¶
- Submit your report in PDF format.
- Submit your python notebook.
- Make sure your notebook is well organized, with no temporary code, commented sections, tests, etc...
- You can upload model weights in a cloud repository and report the link in the report.
FAQ¶
Please check this frequently asked questions before contacting us
Model card¶
You are free to choose the BERT-base model card you like from huggingface.
Model architecture¶
You should not change the architecture of a model (i.e., its layers).
However, you are free to play with their hyper-parameters.
Model Training¶
You are free to choose training hyper-parameters for BERT-based models (e.g., number of epochs, etc...).
Neural Libraries¶
You are free to use any library of your choice to address the assignment (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)
Error Analysis¶
Some topics for discussion include:
- Model performance on most/less frequent classes.
- Precision/Recall curves.
- Confusion matrices.
- Specific misclassified samples.